International Journal of Engineering and Technical Research (IJETR) 
ISSN: 2321-0869 (O) 2454-4698 (P) Volume-7, Issue-10, October 2017 


A Novel Approach for Multiple Vertical 
Fragmentations of Datasets Using Affinity Matrix in 

Distributed Environment 

Sunil Kumar Verma, Dr. Neelendra Badal 


Abstract — Distribution design involves making decisions on 
the fragmentation and allocation of data across the sites of a 
computer network. Multiple vertical splitting is the process of 
subdividing the attributes of a relation to generate fragments. In 
this paper, we propose an analysis for multiple vertical splitting 
algorithm using Slop Based Partitioning Algorithm (SBPA). 
This approach starts from the attribute affinity matrix and 
generates initial clusters based on the affinity values between 
attributes. Then, it uses the database according to optimal 
splitting solution to produce final groups that will represent the 
fragments. Then we analyzed these fragments according to their 
contribution level. The result is generated that shows how to 
find optimal solutions. 

Index Terms — BEA, Multiple Vertical Fragmentation, 
Clustered Affinity Matrix, Attribute Usage Matrix, Frequency 
Matrix, Distributed Database System, Slop Based Partitioning 
Algorithm. 

I. INTRODUCTION 

In distributed computing environments, each unit of data 
(item) which is accessed at the station, (site) is not usually a 
relationship but part of the relationship. Therefore, to 
optimize the performance of the query, the relations of global 
schema are fragmented into items. 

The primary concern of distributed database systems is to 
design the fragmentation and allocation of the underlying 
database. The distribution design involves making decisions 
on the fragmentation and placement of data across the sites of 
a computer network. The first phase of the distribution design 
in a top-down approach is the fragmentation phase, which is 
the process of clustering into fragments the information 
accessed simultaneously by applications. The fragmentation 
phase is then followed by the allocation phase, which handles 
the physical storage of the generated fragments among the 
nodes of a computer network, and the replication of 
fragments. 

A Distributed database is a database that is under the control 
of a central database management system (DBMS) in which 
storage devices are not all attached to a common CPU. It may 
be stored in multiple computer located in the same physical 
location, or may be dispersed over a network of 
interconnected computers. There are multiple sites 
(computers) in a distributed database so if one site fails then 
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system will not be useless, because other sites can do their job 
because same copy of data is installed on every location 

II. FRAGMENTATION 

Fragmentation is a design technique to divide a single relation 
or class of a database into two or more Partitions such that the 
Combination of the partitions provides the original database 
without any loss of information .This reduces the amount of 
irrelevant data accessed by the applications of the database, 
thus reducing the number of disk accesses. 

Further a new algorithm is being described for fragmentation 
of clusters of attributes of database table. Vertical partitioning 
is used during design of a database to enhance the 
performance of query execution. In order to obtain improved 
performance, fragments must closely match the requirements 
of the query workload. Vertical partitioning involves splitting 
the relations along the columns into partitions all having equal 
number of rows. Each partition now acts as a separate relation 
but we preserve the row ordering in all the partitions as it was 
in the original relation. It should be noted that each partition 
may contain more than one column. However, when columns 
in multiple partitions are accessed instead of join we just need 
to do pasting of columns. The advantages of vertical 
partitioning are as follows: If query involves only few columns 
then we avoid unnecessary fetching of other columns. This 
saves the I/O bandwidth and avoids unnecessary processing. 
Moreover data in a column belongs to the same domain e.g., 
values in salary column will be numeric within some range. 
This similarity in data can be well exploited by compression 
algorithms and better compression ratios can be achieved. 

III. MULTIPLE VERTICAL FRAGMENTATIONS 

Multiple Vertical fragmentation is the collective decay 
properties of the relational schema R into the sub schema Rl, 
R2,..,Rm, such that each attribute in these sub schemas is 
often accessed together. 

To show how often the same queries together, Hoffer and 
Severance introduced the concept attribute affinity [11]. 

If Q = {ql, q2, ..,qm} is a set of applications, R(A1, A2, .., 
An) is a relational schemas. The relationship between qi and 
attributes Ajis determined by using the values: 

1 , Aj is 07\ gaged in | 

0 , Aj is not engaged in 

Put (Ai, Aj) = {qeQ I use(q, Ai). use(q, Aj) = 1}. Attribute 
affinity between Ai and Ajis: 

AffUi'Aj) = y ™/[ (g) 11 MC; (qj) 
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In particular, refl(q): the number of pairs of attributes (Ai,Aj) 
is referenced in the application q at station SI; accl(q): 
frequency of access to applications q in station SI. BEA 
algorithm consists of two main phases: 

1) Permutations row, column affinity matrix of attribute to 
obtain the cluster affinity matrix (CA) which has global 
affinity measure AM (global affinity measure) [1] is the 
largest. 

2) Find the partition of the set of attributes from the matrix CA 
by exhaustive method, so that: 

Z= CTQ *CBQ - COQ2 is the maxima, with: 

CTQ = ^ rsf; (tjj )acc j (q i J) 

q±7 Q vSj 

COQ = ^ rsfj (qj~)accj (<?;)) 
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In which, 

AQiq{) = A|) = l}: 

TQ = 

BQ = IqMQiqd E BA}; 

OQ = Q\{TQ U BQ} 

The complexity of the algorithm is proportional to n2. 

IV. FRAGMENTATION PROCEDURE 

In this section SB PA, used for Vertical Partitioning of 
relation, is discussed in detail. Firstly using the AUM and FM, 
Clustered Affinity Matrix (CAM) is calculated. After 
calculation of CAM, SBPA is used to make the fragments of 
the relation. 

A. ATTRIBUTE USAGE MATRIX 

The Attribute Usage Matrix is used to show the attributes of 
relation used by a query. For each query QI and each attribute 
AJ, an Attribute Usage Value 0 or 1 is associated in AUM. 
The associated value is 1 if the attribute AJ is used by query 
QI otherwise the value associated is 0. 

r % f 1 / if Attribute A f is used by Query QA 

™ Wi - i } ~h f Otherwise " J 

Each row of AUM refers the attributes used by the 
corresponding query. The “1” entry in a column indicates that 
the query “uses” the corresponding attribute. 


Consider the following 4 queries for relation PROJ 

ql: SELECT BUDGET q2 : SELECT PNAME,BUDGET 
FROM PROJ FROM PROJ 

WHERE PNO=Value 

q3: SELECT PNAME q\\ SELECT SUM (BUDGET) 
FROM PROJ FROM PROJ 

WHERE LOC=Value WHERE LOC=Value 

Let Al= PNO, A2= PNAME, A3= BUDGET, A4= LOC 

Table 1 is an example of Attribute Usage Matrix in this paper. 


Table 1: Attribute Usage Matrix 


Query 

Attribute 

AI 

A2 

A3 

A4 

Ql 

1 

0 

1 

0 

Q2 

0 

1 

1 

0 

Q3 

0 

1 

0 

1 

Q4 

1 

0 

0 

1 


B. FREQUENCY MATRIX 

The frequency matrix represents the number of time a query is 
fired from one or more sites. Table2 is an example of the 
Frequency Matrix used in this paper. 


Table 2: Frequency Matrix 


Query 

Site 

SI 

S2 

S3 

Ql 

10 

12 

15 

Q2 

7 

0 

0 

Q3 

30 

25 

20 

Q4 

5 

0 

0 


C. ATTRIBUTE AFFINITY MATRIX 

Attribute affinity value measures the strength of an imaginary 
bond between the two attributes. It is predicated on the fact 
that attributes are used together by the query. Attribute 
affinity value represents the number of times two attributes 
are accessed together at all sites. 

Attribute affinity value between two attributes AI and AJ of a 
relation R[Al,A2...AN]with respect to the set of queries 
Q={Q1, Q2...QQ}is defined as follows. 

Attribute affinity value between AI and AJ is represented as 
aff (AI, AJ). 

Aff(A J ..Aj'j = 'V QtieryA 

£-^Aii fiisriss tkut uttsss AS ziid Aj 

Where, 

Query 

iTi j? jr UCCSBS 

3LCCQSS=^ a[ i sit& oecess frequency of a query* 

Query access (QI) = 37 

Query access (Q2) = 7 

Query access (Q3) = 15 

Query access (Q4) = 5 

aff (A1,A3) =£Q1 query access=37 

In the same way the whole Attribute affinity value is 

calculated. 
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Table 3: Attribute Affinity Matrix 


Attribute 

Attribute 

Al 

A2 

A3 

A4 

Al 

42 

0 

37 

5 

A2 

0 

82 

7 

75 

A3 

37 

7 

44 

0 

A4 

5 

75 

0 

8 


D. CLUSTERED AFFINITY MATRIX 

For the fragmentation of attributes in a relation, firstly 
attributes must be clustered. Clustering problem is widely 
researched in databases, data mining and statistics 
communities [8], [9], [10], [11], [12], [13]. Hoffer and 
Severance in [2] has suggested that the Bond Energy 
Algorithm (BEA) should be used for this purpose. The Bond 
Energy Algorithm takes Attribute Affinity Matrix as input, 
changes the order of its rows and columns, and produces a 
Clustered Affinity Matrix (CAM). Bond Energy Algorithm 
makes the cluster of those attributes which have high Attribute 
affinity value. 

Placement of attributes in CAM 
Placement of A1 and A2: 

In the initialization step first and second columns of AAM is 
placed to the first and second column of CAM respectively. 
Attribute A1 is placed at position 1 in CAM: [Al] 

Attribute A2 is placed at position 2 in CAM: [Al, A2] 

Placement of A3: 

Contribution of attribute A3 at position 1 in CAM= 6364 
Contribution of attribute A3 at position 2 in CAM = 6860 
Contribution of attribute A3 at position 3 in CAM = 1764 
Attribute A3 is placed at position 2 in CAM: [Al, A3, A2] 

Placement of A4: 

Contribution of attribute A4 at position 1 in CAM = 1220 
Contribution of attribute A4 at position 2 in CAM = - 3724 
Contribution of attribute A4 at position 3 in CAM = 23956 
Contribution of attribute A4 at position 4 in CAM =24300 
Attribute A4 is placed at position 4 in CAM: [Al, A3, A2, 
A4] 

Hence in Clustered Affinity Matrix, the order of placing the 
attributes in rows and columns are given below: 

[Al, A3, A2, A4] 


Table 4 Clustered Affinity Matrix 


Attribute 

Attribute 

Al 

A3 

A2 

A4 

Al 

42 

37 

0 

5 

A3 

37 

44 

7 

0 

A2 

0 

7 

82 

75 

A4 

5 

0 

75 

80 


AM — ^ (affinttyofA;andAjWitkt fcg irneighb crsj 


i j 

• The bond energy algorithm (BEA) was developed and 

has been used in the database design area to 
determine how to group data and how to physically 
place data on a disk. 

• It can be used to cluster attributes based on usage and 

then perform logical or physical design accordingly. 
With BEA, the affinity (bond) between database 
attributes is based on common usage. 

• This bond is used by the clustering algorithm as a 

similarity measure. The actual measure counts the 
number of times the two attributes are used together 
in a given time. To find this, all common queries 
must be identified. 

• The idea is that attributes that are used together form 

a cluster and should be stored together. In a 
distributed database, each resulting cluster is called a 
vertical fragment and may be stored at different sites 
from other fragments. 

• The basic steps of this clustering algorithm are: 

i. Create an attribute affinity matrix in which 

each entry indicates the affinity between the 
two associate attributes. The entries in the 
similarity matrix are based on the 
frequency of common usage of attribute 
pairs. 

ii. The BEA then converts this similarity matrix 

to a BOND matrix in which the entries 
represent a type of nearest neighbor 
bonding based on probability of co-access. 
The BEA algorithm rearranges rows or 
columns so that similar attributes appear 
close together in the matrix. 

iii. Finally, the designer draws boxes around 

regions in the matrix with high similarity. 
The resulting matrix, modified from, is illustrated in Figure 1. 
The two shaded boxes represent the attributes that have been 
grouped together into two clusters. 


A| A 2 Aj A 4 A w ^ | A„ 

A, 

A 2 
A, 

A 4 


A* . | 

A„ 

Figure 1: Clustered Affinity Matrix for BEA 



V. BOND ENERGY ALGORITHM 

Bond Energy Algorithm (BEA) has been usedfor clustering of 
entities. BEA finds anordering of entities (in our case 
attributes)such that the global affinity measure ismaximized. 


Two attributes Ai and Aj have a high affinity if they are 
frequently used together in database applications. At the heart 
of the BEA algorithm is the global affinity measure. Suppose 
that a database schema consists of n attributes {Al, A2,, An}. 
The global affinity measure, AM, is defined as 
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T2 

AW = ^ (bond. (A[, Ai_f) -fi- Acixd (A[, A[ + 1 ]) 
! = 1 


Vertical fragmentation based on knowledge- oriented 
clustering technique- 

To illustrate the vertical fragmentation algorithm based on 
knowledge-oriented clustering techniques. We use the 
assumption of examples about vertical fragmentation problem 
based on BEA algorithm is presented in [1], [8]: 

The set of attributes At= {Al, A2, A3, A4} 

The set of transactions Q = {ql, q2, q3, q4}. The matrix used: 


qi 

q^ 

q^ 

q4 


Ai 
' 1 
0 
0 
0 


A2 A3 A4 

0 1 if 

1 1 0 

1 0 1 

0 1 1 


The frequency of application execution with a set of 
transactions {ql, q2, q3, q4}, andF = {fl, f2, f3, f4} = {45, 5, 
75,3}. 

From the assumption, we have the reference feature vectors: 



qi 

q- 

q3 

q4 

VAi = 

45 

0 

0 

0 

va 2 = 

0 

5 

75 

0 

va 3 = 

45 

5 

0 

3 

VA4= 

0 

0 

75 

3 


The similar matrix S4x4= (s(Ak, Al)) k=l,4; 1=1,4 



Ai 

A 2 

A 3 

A 4 

Ai 

^ 1 

0 

0.9918 

0 

a 2 


1 

0.0073 

0.9970 

A 3 



1 

0.0026 

A 4 




1 


The result of the vertical fragmentation algorithm based on 
the clustering a lgorithm towards knowledge-o riented. 


Cluster 

Set of 
attribues 

1 

} 

2 



Split-point if slop diminishes between the se attributes very 
rapidly .The pseudo code for the SBPA is given below: 

Algorithm: SBPA 

Input: CAM: Clustered affinity matrix 
Output: F: set of four fragments 

Begin 

{Initialization of the variables} 


X [1,1.N]; 

Y [1,1.N]; 

Smallest=0; 

value 

Split-point=0; 


//used to store the value froml 
To No floopin corresponding index 
// used to store the value of slop 
// used to store the smallest slop 

// used to store the point from 


{Determine the Split-point} 


Do 

begin 

Fori =ltondo 

If (i==l) then 

Y[l,i]=CAM(l,i); 

Else 

Y[l, i]=CAM (l,i)-CAM (l,i-l); 
End-If 

X[l,i] =i; 

End-For Plot (X, Y); Smallest=Y [1,1]; Split-point=l; 

For i=2ton 

If (Smallest Y [l,i]then 

Split-point is recorded as X [l,i] Smallest=Y [1, i] 

End-If 
While [i=4] 

End-For 

End-Begin 

This above SBPA is divided into three steps 

• Initialization: In this step the user initializes the 
variables and array required by algorithm. 

• Processing: In the processing step ,first row of CAM 
is taken for fragmenting the clusters from a relation 
The user takes the difference of CAM(l,i) and 
CAM(l,i-l) and store it at Y[l, i]. 


Table 5: First Row of clustered Affinity Matrix 


Attributes 

Attributes 

Al 

A3 

A2 

A4 

Al 

42 

37 

0 

5 


This fragmentation results correlate with the results of the 
multiple vertical fragmentation by algorithm BEA. 

VI. PRAPOSED SBP AFGORITHAM 

The objective of Slop Based Partitioning Algorithm is to find 
a set of attributes that are frequently accessed by distinct set of 
queries. Using the Slop Based Partitioning Algorithm, the 
user makes the fragments of a relation on the basis CAM, 
which is calculated by AUM and FM. The first row of CAM is 
taken for fragmenting the clusters from a relation. The point 
between the neighbor attributes of the CAM is considered as 


• Comparison: In the last step the user finds the 
smallest value of Y [1, i] which represents the rapid 
diminishing of slop. The index i at which value of Y 
[1, i] is the smallest the corresponding value of X [1, 
i] is considered as Split-point. The following 
Calculation is performed with referenced to CAM. 

Y [1, 1] =CAM (1, 1) =42, X [1, 1] =1 

Y [1, 2] =CAM (1, 2)-CAM (1, 1) = 37-42= -5, X [1, 2] = 2 

Y [1, 3] =CAM (1, 3)-CAM (1, 2) = 0-37= -37, X [1, 3] =3 

Y [1, 4] = CAM (1, 4)-CAM (1, 3) =5-0 =5, X [1, 4] =4 

The Plot command in pseudo code plots the following graph. 

The graph shows the slop value Y [1, i] at point i. 
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Figure2: Graph between CAM [1, i] and X [1, i] 


The above graph shows the slop diminishes very rapidly 
between X [1, i] =2 and X [1, i] =3. So the Split-point is 
recorded between second and third attribute of CAM. So the 
above Clustered Affinity Matrix can be divided into two 
fragments. One fragment contains the attribute {Al, A3} and 
second fragment contains the attributes {A2, A4}. 

So for the fragmentation of relation R [Al, A2, A3, A4] has 
done as below: 

[Al, A3] I [A2, A4] 


VII. RESULTS 

An experiment has been carried out to test the working of 
proposed Vertical Partitioning algorithm SBPA. It has been 
carried out on a system with core i3 processor, 3GB RAM, 
Matlab toolbox and MS Access database. A relation with 
name Project has been used for partitioning. The Project 
relation has been stored in MS Access database as following: 


Table 5:Input Project 


PNo 

PName 

Budget 

Location 

PI 

Instrumentation 

150000 

Montreal 

P2 

Database Develop 

135000 

New York 

P3 

CAM/CAD 

250000 

New York 

P4 

Maintenance 

310000 

Paris 


The Project relation has tested against set of four queries Ql, 
Q2, Q3 and Q4 generated from any of the three sites named 
SI, S2 and S3. 

Ql: Find the Budget from the Project where given its 
identification number. 

(SELECT BUDGET, FROM PROJECT, WHERE 
PNO=Value) 

Q2: Find the Name and Budget of all Projects. 

(SELECT PNAME, BUDGET FROM PROJECT, WHERE 
LOC ATION=V alue) 

Q3: Find the Name of projects located at given city. 
(SELECT PNAME, FROM PROJECT, WHERE 
LOCATION=V alue) 

Q4: Find the PNo and total project Budget for each city. 
(SELECT PNo, SUM (BUDGET), FROM PROJECT, 
WHERE LOCATION^Value) 

The Attribute Usage Matrix of the above queries set is as 
following. 


Table 7:Attribute Usage Matrix Project 


Query 

Attribute 

PNo 

PNAME 

BUDGET 

LOCATION 

Ql 

1 

0 

1 

0 

Q2 

0 

1 

1 

1 

Q3 

0 

1 

0 

1 

Q4 

0 

0 

1 

1 


The frequency of queries Ql, Q2, Q3and Q4 at three sites has 
considered as following: 


Table 8: Frequency Matrix Project 


Query 

Site 

SI 

S2 

S3 

Ql 

20 

25 

10 

Q2 

5 

2 

0 

Q3 

16 

18 

30 

Q4 

3 

2 

1 


Using the Bond Energy Algorithm proposed by Hoffer and 
severance in [2], Clustered Affinity Matrix is calculated from 
Attribute Usage Matrix Project in Table 7 and Frequency 
Matrix Project in Table 8. 


Table 9:Clustered Affinity Matrix Project 


Attribute 

Attribute 

PN 

0 

BUDG 

ET 

PNAM 

E 

LOCATI 

ON 

PNo 

61 

61 

6 

0 

BUDGET 

61 

68 

13 

7 

PNAME 

6 

13 

77 

71 

LOCATI 

ON 

0 

7 

71 

71 


After calculating the Clustered Affinity Matrix, relation 
project in Table 6 has been partitioned into two fragments 
using the SBPA. One fragment named as Projectl has 
attributes PName and Location while other fragment named 
as Project2 has attributes PNo and Budget as followed. So the 
relation project having four attributes PNo, PName, Budget, 
Location is partitioned into two fragments for the above given 
Attribute Usage Matrix in Table 7 and Frequency Matrix in 
Table 8 respectively. 


Table 10: Project 1 


PName 

Location 

Instrumentation 

Montreal 

Database Develop 

New York 

CAM/CAD 

New York 

Maintenance 

Paris 


Table 11: Project 2 


PNo 

Budget 

PI 

150000 

P2 

135000 

P3 

250000 

P4 

310000 
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VIII. Conclusion 

ABOUT THE AUTHOR 

The purpose of conducting this study is to know the 
comparative study of various vertical splitting techniques. We 
know in centralized databases multiple vertical 
fragmentations is widely used to do the fragments, so our 
study presents a view for ease of use of multiple vertical 
fragmentation. 

Distributed databases have many aspects and every 
organization has certain preferences. For the telecom sector, 
the response time is prioritized. 

Our experiment showed that the how to calculate the 
fragments for different criteria used. In the distributed 
database, data is fragmented. These fragments are short 
compared to the full database (centralized database contains 
maximum columns). However, when we need data from 
multiple sites for a query (report queries), the response time is 
increased. Accessing data from multiple remote sites and then 
joining those takes long time. But in the centralized database 
since data is at one place so, it is easy and fast to search it. 

Multiple Vertical Partitioning algorithms SBPA has 
presented and successfully implemented for improving the 
Query-Response-Time in Distributed Database System. The 
proposed algorithm SBPA has used CAM. In the first phase, 

CAM is calculated from AUM and FM. In the second phase, 
the two fragments of the relation are created by CAM Using 
SBPA. 
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